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Abstract 

Word matches are often used in sequence comparison methods, 
either as a measure of sequence similarity or in the first search steps of 
algorithms such as BLAST or BLAT. The D2 statistic is the number of 
matches of words of k letters between two sequences. Recent advances 
have been made in the characterisation of this statistic and in the 
approximation of its distribution. Here, these results are extended to 
the case of approximate word matches. 

We compute the exact value of the variance of the D2 statistic 
for the case of a uniform letter distribution, and introduce a method 
to provide accurate approximations of the variance in the remaining 
cases. This enables the distribution of D2 to be approximated for 
typical situations arising in biological research. We apply these results 
to the identification of cis-regulatory modules, and show that this 
method detects such sequences with a high accuracy. 

The ability to approximate the distribution of D2 for both exact 
and approximate word matches will enable the use of this statistic in 
a more precise manner for sequence comparison, database searches, 
and identification of transcription factor binding sites. 

1 Introduction 

Alignment-free sequence comparison methods based on word matches allow se- 
quences to be compared without assuming contiguity of homologous segments. 
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This is of particular interest for the comparison of biological sequences, where dele- 
tions, insertions or duplications of segments are common. Several such methods 
have thus been implemented (see |12j . for example), and have had various appli- 
cations, such as the clustering of large EST databases (for example, These 
applications, however, typically rely on empirical thresholds, rather than on rigor- 
ous statistical theory. 

One of the statistics for alignment free sequence comparison that has received 
much attention is the D2 statistic, which measures the number of words shared 
between two sequences. The characterisation of this statistic started with the 
calculation of its mean, and with approximations to the variance [26] . Later, more 
accurate approximations of the variance allowed asymptotic regimes of D2 to be 
derived for non-uniform [T9] and uniform [15] letter distributions. More recently, 
the exact value of the D2 variance has been computed [IGUlOj. In parallel with this 
theoretical effort, optimal word sizes for typical biological situations were computed 
[1], and practical approximations of the distribution of D2 in these settings were 
proposed [10] . 

A more general version of the D2 statistic is the number of approximate word 
matches between two sequences. After an initial characterisation of the mean 
of this statistic, an asymptotic distribution regime was characterised when the 
logarithm of the sequence size is large compared with the word size p] . Here, we 
further characterise the D2 statistic in the case of approximate word matches, by 
computing its variance and proposing approximations of its distribution for typical 
biologically relevant situations. Finally, we present an application of these results 
to the identification of regulatory sequences. 

2 Results 
2.1 Definitions 

The statistic D2{nA,nB, k, t, rj) {D2 henceforth) is the number of approximate word 
matches of length k with up to t mismatches between sequences A = [Ai . . . A^^) 
and B = [Bi . . . Bn^ ) with Ai and Bj belonging to an alphabet A and distributed 
according to a letter distribution parameterised by rj. As previously [TU], for 
mathematical convenience we will impose periodic boundary conditions, that is, 
the letter in the first position in a sequence is assumed to follow the last letter 
of that sequence. Also, only the case of strand symmetric Bernoulli text will 
be considered, that is, sequences built from alphabets of four iid (independent 
and identically distributed) letters (A, T, G and C) with the further constraint 
that the probability of letter a € A occurring \s = = \{^ + v) ^-^^d 
£,G = ic = i(l ~ ^)) where < 77 < 1. Note that the periodic boundary conditions 
simplify the theoretical calculations considerably, but allow the method to be used 
for linear as well as circular sequences by appropriate preprocessing (see Section 
2.5 for example). 
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Defining the t neighbourhood match indicator 

^..^fl if A{{Ai,...,Ai+k-i),{Bj,...,Bj+k-i)) <t 
^'^'^^ \ otherwise ^ ' 

where A{wi,W2) is the number of mismatches between the words wi and W2, the 
D2 statistic is given by 

where the index set is / = : 1 < i < ua, 1 < J < ns}- 



2.2 D2 mean 

The mean of D2 was first computed for exact word matches (t = 0) and iid letters 
|26j . This was later extended to the case of letters generated by a Markov model 
|16]. A formula for the mean was also computed for approximate word matches 
{t > 0) in the case of Bernoulli symmetric text [3| in terms of the perturbed 
binomial distribution [2T]. In Appendix IA.2I we derive the equivalent formula 

^[^2] = ^E0)(3-^W + r/y-'. (3) 
1=0 ^ ^ 



2.3 D2 variance 

An exact formula for the variance of D2 was derived in the case of iid letters and 
exact word matches using periodic boundary conditions in [10]. Another study 
computed the variance for exact word matches using free boundary conditions, in 
the cases of iid letters and of letters generated by a Markov model [16] . Here we 
extend these results to the case of approximate word matches for iid letters and 
Bernoulli symmetric text, using periodic boundary conditions. Specific details of 
this technical derivation are given in Appendix IA.3i A brief summary is given 
below. 

To calculate the variance of D2 for approximate word matches and symmetric 
Bernoulli text, we follow the method used in [10]. First we deduce from equa- 
tion ([2]) that: 

Varp2)=Var I ^ Y^,A =^11 Cow {Y^,^),Y^,^,)). (4) 
\(ij)e/ / (i,i)G/(i',j')e^ 

We set u = v = and for fixed u split the sum over v as follows. Let 

Ju = {v = {i' tJ') ■ \i' — i\ < k or \j' — j\ < k} be the dependency neighbourhood 
of Yu- For V ^ Ju, Gov (Y^, Yy) = 0. Ju is decomposed into two disjoint sets [26] : 
an accordion set, Ju = {v = (i' ■ \i' — i\ < k and \j' — j\ < k} (when two pairs 
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Table 1: Contribution of the index sets of the dependency neighbourhood to 
the variance of D2. See text for definitions. 



of matching words overlap in both sequences) and a crabgrass set, = Ju\ 
(when two pairs of matching words overlap in one sequence only). The accordion 
set is further decomposed into a diagonal part, J^'^ = {v = {i' : —k < i' — i = 
j' — j < k} and an off-diagonal part, = Ju\ "^u^- 

Table[T]gives a summary of the components of the variance in different settings. 
The only case that is not analytically characterised is the off-diagonal part of the 
accordion for approximate word matches and non-uniform letter distribution. In 
this case, however, numerical tables can be assembled to approximate the entire 
accordion part of the variance with good accuracy. To see this, note that the 
accordion part takes the form nAnB^ik,t,r]). When ua = ns = 2k — 1, the only 
index set contributing to the variance is the accordion part. Although computing 
D2 for approximate word matches requires an algorithm with complexity o^riAnB), 
it is relatively inexpensive to approximate the variance of D2 by simulation for 
small riA and ub- Tables of the function $ were thus approximated by simulating 
a large number of pairs of sequences of length 2k — 1 for k < 16 and setting 
^{k,t,r]) = Var(L»2(2/c -1,2k- l,k,t,'n))/{2k - l)2(see below). 

2.4 D2 distribution 

It has been shown previously [lOj that for exact word matches and in most bi- 
ologically relevant situations, a distribution chosen ad-hoc such as the gamma 
distribution can provide a better estimate of the D2 distribution than the asymp- 
totic normal distribution. Here we provide approximations for the distribution of 
D2 in the case of approximate word matches. 

For convenience we have set ua = ^b = n in our numerical simulations. We 
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have simulated the distribution of D2 for sequence sizes ranging from small ESTs 
(n = 100) to reasonably large genes {n = 3200), for even word sizes k between 
2 and 16, for every possible number of mismatches (0 < t < /c), and for both 
uniform (r/ = 0) and non- uniform (?7 = g) letter distributions. For each combi- 
nation of parameters, 10^ pairs of iid sequences were generated. Tables of the 
accordion contribution function <1> were estimated by generating 10^ pairs of iid 
sequences of size n = 2k — 1, with k ranging from 2 to 16 with an increment 
of 2. The Mersenne-Twister random number generator [2U] was used, as im- 
plemented in the GNU scientific library (http://www.gnu.org/software/gsl/). 
The code was written in ANSI C and is available from the authors' website 



(http : //wwwmaths . anu.edu.au/cbis/~sf/k_words). 



Previously, the gamma distribution was used to approximate the D2 distribu- 
tion in the case of exact word matches Here, the beta distribution scaled 
to the range [0, n^] is used instead of the gamma distribution. In the range of 
parameters assessed in our simulations, the gamma and beta distributions are 
mostly indistinguishable (data not shown). It might be expected, however, that 
the beta distribution provides better approximations for very small p-values, as it 
is bounded within the same domain of definition as D2 ([0, n^n^]), whereas the 
gamma distribution is defined from zero to infinity. Histograms of our numerical 
simulations the D2 statistic are compared with the density function of the beta 
distribution scaled to this interval, that is 

-^fB(^;a,^) (5) 
UAnB ynAUB ) 

where fB{x;a,f3) = T{a + /3)/(T{a)T{/3))x"'~^{l — xy~^ is the canonical density 
function of the beta distribution. The parameters a and f3 are set so that the 
mean and variance of the scaled beta distribution agree with the theoretical values 
H = E[D2], cr^ = Var {D2) derived in the appendix: 



a = 



a2 



UAUB - fJ- 

/3 = 

UAUB 



lx{nAnB - P)_ 



C72 



(6) 



Figure [T] shows the simulated distribution of D2 for the size typical of a small 
EST or a read produced by the 454 Titanium technology (sequence size n = 400), 
in the case of non- uniform letter distributions (77 = ^). The word sizes displayed 
in this figure are the optimal word sizes corresponding to the associated number of 
mismatches. We use the optimal word sizes computed previously in |9j. In brief, 
a word size and number mismatches combination is optimal when it best captures 
the relatedness between artificially evolved sequences using the D2 statistic as a 
relatedness estimator. 

The quantile-quantile plots between the beta and normal distributions, and the 
simulated D2 distribution show unambiguously that for these parameters combi- 
nations, the beta distribution provides a closer fit to the D2 distribution than the 
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Figure 1: Top row. Histograms of the simulated distribution of D2 for se- 
quences of size n = 400 and non-uniformly distributed letters. The normal 
distribution is shown in red and the beta distribution in blue. The insert 
shows a close up on the far right of the tail larger than the 99*'* percentile. 
Bottom row: quantile-quantile plots with the simulated D2 values horizon- 
tally, and the normal (continuous red line) and beta (dashed blue line) values 
vertically. The vertical dashed lines represent the 0.99 and the 0.9999 quan- 
tiles. 



normal distribution. Similar figures for all the simulations can be found on the au- 
thors' website (http://wwwinaths.anu.edu.au/cbis/~sf/k_words). We observed 
a few rare situations where the normal distribution outperformed the beta distri- 
bution, but these were cases where the number of mismatches was close to the 
word size, and are of little practical importance. 



2.5 Application to the detection of regulatory sequences 

We now apply the approximation of the D2 distribution to a practical biolog- 
ical problem: the identification of sequences containing cis-regulatory modules 
(CRMs). 

We use the same dataset as jl6J, which contains seven sets of sequences known 
to contain CRMs. Within each set, the CRMs are driving gene expression in 
one particular tissue or life stage. The sets contain between 9 and 82 sequences. 
For each of these 'positive' sets, a 'negative' set was constructed from randomly 
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chosen non-coding sequences of the same species, containing the same number of 
sequences and with the same sequence sizes as in the positive set. 

In [16], the authors primarily assessed whether their method can capture an 
expected effect, namely that sequences known to contain similar (CRMs) are more 
related to each other than are randomly selected sequences. While they show that 
the D2 based approach clearly outperforms other techniques, this approach is of 
limited practical use. 

We chose instead to address a problem more frequently faced by practition- 
ers: given a set of sequences known to contain CRMs, and a query sequence, can 
the query sequence be classified as containing similar CRMs or not? We set up 
the following experiment: each sequence in each positive set was selected as the 
query sequence and compared both to the remaining positive sequences of this set 
and to the corresponding negative sequences. In order that our theoretical results 
for the iid hypothesis null distribution could be applied, each sequence was pre- 
processed by (1) joining the ends to effect periodic boundary conditions and (2) 
removing masked tandem repeats present in the data sets and concatenating the 
pieces either side of the removed portion. The parameters ua and ub were taken 
from the preprocessed sequences and for each pairwise comparison the parameter 
?7 estimated from the combined letter frequencies of the two sequences in question. 
The query sequences were then screened to accept only those for which the small- 
est smallest p-value of all comparisons was less than 0.01. We used a stringent 
criterion, namely, a positive query sequence was considered correctly classified if 
the smallest p-value was obtained with another sequence of the positive set. 

Figure [2] shows the results of this experiment. A good sensitivity is achieved 
in most datasets, with typically 80% or more of the sequences correctly classified 
for at least one parameter combination using this stringent criterion. The optimal 
parameters vary from one condition to another. This may reflect different prop- 
erties of the underlying CRMs, in terms of size, letter composition and level of 
conservation that they require in order to be functional. The problem of choos- 
ing optimal parameters is easily solved by using the above approach, namely by 
determining a set of positive sequences and using these to estimate appropriate 
parameters before comparing the query sequence(s) to them. 

The percentage of correctly classified negative sequences based on the smallest 
p-value was typically around 50% (data not shown) . This suggests that while this 
method can successfully identify candidates, further validation of the candidates 
would be needed. 

3 Discussion 

In this study we present exact values and approximations of the variance of D2 
for pairs of symmetric Bernoulli texts. These results enable the distribution of D2 
to be approximated with or without mismatches for most situations occurring in 
biological research. 



7 



fly blastoderm (82) l1yPNS(23) fly tracheal (9) fly eye (17) 




0123 0123 0123 012 

# mismatches, t # mismatches, t # mismatches, t # mismatches, t 



human muscle (28) human liver (9) human HBB (17) 




0123 0123 0123 

# mismatches, t # mismatches, t # mismatches, t 



Figure 2: Percentage of times that a sequence containing CRMs is correctly 
classified: each subplot corresponds to a type of CRM, and the numbers in 
parentheses are the number of positive control sequences in each set. Per- 
centages are only plotted if at least 4 query sequences survived the screening 
requirement that the minimum p-value should be less than 0.01. 
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We illustrate the application of these results by using the D2 statistic to iden- 
tify sequences containing regulatory modules. Our results show that this method 
can be used to identify candidate regulatory sequences for further experimental 
validation, or in combination with other prediction methods. 

A remaining theoretical problem is evaluation of the variance and distribution 
of the D2 statistic in the case of approximate word matches for strings that are not 
symmetric Bernoulli texts, such as proteins. This lack of theory could be partially 
circumvented by using exact word matches for protein searches, but using alphabet 
reduction to account for most common substitutions. A similar alphabet reduction 
resulted in increased accuracy in the construction of phylogenetic trees with an 
alignment free method |14j . 

Appendix 

A Derivation of D2 mean and variance 

Define the statistic D2 to be the number of fc-word matches with up to t mismatches 
(t = 0, . . . , A;) between sequences A and B of letters drawn from an alphabet A. 
Let the sequence lengths be ua and ub respectively, and assume each sequence to 
consist of i.i.d. random letters with probability of letter a ^ A occurring at any 
given location, where Yla&A^^^ ~ ^- ^^so assume periodic boundary conditions on 
both sequences, that is, the letter in the first position in sequence A is assumed to 
follow the letter in the nj^^ position, and the letter in the first position in sequence 
B is assumed to follow the letter in the n^*^ position. 

In general, we restrict ourselves to the case of strand symmetric Bernoulli texts 
of nucleotide sequences, that is, i.i.d. sequences for which ic = = 5(1 ~'^)) = 
?T = 3(1 + where < < 1, and write the D2 statistic as D2{nA,nB,k,t,ri). 

A.l Preliminaries 

We use the following terminology adapted from [3]: 

1. For m = 1,2,..., define pm = "^aeA^"."^ • strand symmetric Bernoulli 
texts, p2 = {1 + rf)/4.. 

2. Define A(Wi,W2) to be a random variable equal to the number of mis- 
matches between the two random /c- words Wi and W2. When there is no 
possibility of confusion, we simply write A (A;) for the number of mismatches 
between the two random /c-words. One easily checks that A (A;) is a binomial 
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random variable: 

Pr (A(/c) = /) = Pr (Exactly / mismatches and k — I matches) 
'k 

'k 

'k\l_ 

1)4:^ 



(prob. of mismatch)' (prob. of match)'' ' 

{1-P2)W-' 

{'i-ri')\l+rff'K (7) 



3. y(i^j) = Yu = the approximate word match indicator, taking the value 1 if 
the number of mismatches between fc-word at i in ^ and the /c-word at j in 
B is at most t. That is: 

^fl if A((Ai,...,A,+fe_i),(Si,...,B,+fc_i)) <t 

\ otherwise ^ ' 

Note thatZ)2 = EriiE"fi%i)- 

4. gt{k,rj,c), Gt{k,ri,c), probability and cumulative distribution functions of 
the perturbed binomial distribution [21]. Given a fixed A:-word with CG- 
content c (c = 0, . . . , /c), gt{k,r],c) (resp. Gt{k,r],c)) is the probability that 
the number of mismatches between that word and a random fe-word will be 
equal to (resp. at most) t. Specifically: 



Gt{k,'n,c) = ^gr{k,r],c) (9) 

r=0 

9t{k,v,c) = h{k,r],c)ut{k,ri,c), (10) 
where < c, t < /c are integers, and 

h{k,n,c) = l-il-r^ni + ^f-'^ (11) 
ut{k,r],c) = X] (j) (yt^ i ^ 

In the above definition, we follow a convention that (") = if a < or 
a > n. 

5. We set / = ■ 1 < i < riA, 1 < j < ns}- Given u = G /, the 

dependency neighbourhood of u is defined as: 

Ju = {v = : \i' -i\<k or \j' - j\ < k}. (14) 
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Note that for v ^ Ju, Gov {Yu, Y^) = 0. Ju is divided into two parts, accor- 
dion and crabgrass defined by 

Ju = = e J« : \i' -i\<k and \j' - j\ < k} 

Ju = Ju\Ji. (15) 

The accordion set is further decomposed into a diagonal part, J^'^ and an 
off-diagonal part, J^°: 

Ju = {v = {i',j'):-k<i'-i=j'-j<k} (16) 
Ju" = J:\jf- (17) 

A. 2 Mean of D2 

An equivalent and more concise formula for -E[D2] to that given in [H] is 

E[D2{nA,nB,k,t,i])] = ^[%,)] 

t 



Y^Pv{A{k)=l) 

1=0 

nAnB|:Q(l-P2)W'-' 



nA^B fk 



4fc 



A. 3 Variance of D'l 



An exact formula for the variance of D2{nA,nB,k,0,r]) (i.e. the case of exact 
word matches) has previously been given by [TU]. The case of approximate word 
matches, < t < A;, is dealt with here. We have 

\ai {D2{nA,nB,k,t, 7])) = Var ( ^ 

\uei J 

= Gov {Yu, Y,) + Y,Y1 

uei tieJg we/ t)G Js 

= Var (1^2) I crabgrass + Var (1^2) I accordion • (19) 

Below we give an exact formula for the crabgrass part. A convenient exact formula 
for the accordion part remains intractable in general, and we give below a practical 
alternate numerical method for its evaluation. For the case of a uniform letter 
distribution, r] = 0, we demonstrate below (in section [A. 4p that only the diagonal 
part of the accordion contributes to the variance of D2, and give an exact formula 
for this case. 
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A. 3.1 Crabgrass contribution to Var {D2) 

Prom Eqs. (6) and (7) on page 9 of [3j, the crabgrass contribution is given by 
Varp2)|„abgra.s = ^ ^ Cov (F^, F.) 

k-l 

= nAUBiriA + riB - 4k + 2) ^ Var (/|,,j(W)), 

r=-fc+l 



(20) 



where, for a given {k — r)-word w € ^, 

min(r,t) 

/,(w) = Yl PH^{r) = l)Gt-i{k-r,r,,c, 
1=0 
min(r,i) 



E [,)il-P2)W-'Gt.iik-r,rj,c^) 
1=0 ^ ^ 



i=0 

where Cw is the GC-content of w. The variance with respect to the random (k — r)- 
word W is calculated using 

Var (/,(W)) = EifriW)^] - E[fr{W)f. (22) 

Since the w-dependence of the function fr is only via the GC-content of w, the 
expectation values are calculated using 

k—r 



c=0 

, — n \ ^ / 



c=0 
k—r 



E (\ ^) - + v)'-''-''Hc). (23) 

c=0 ^ ^ 



A. 3. 2 Accordion contribution to Var (D2) 
The accordion part is 

Var(I?2)Lccordion = ^ E (l^., 



u€l v&J^ 

nAnB^{k,t,ri), (24) 
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where 

fc-1 fc-1 

^k,t,v)= E Cov(y„,y„+(,,,)) (25) 

r=— fc+l s=— fc+1 



is independent of ua and ns- For the case ua = ns = 2k — I, Eq. (j20|) imphes 

Var (-D2)|crabgrass = 0, giving 

.(M..) = ^"'°-%-;:^,';r'-^-'-'"' , p^) 

which can be estimated numerically by measuring the variance of D2 for a large 
sample of pairs of sequences of length 2k — 1. Tables of ^{k, t, Vj) can be assembled 
for a range of parameters to provide a practical way of numerically calculating the 
accordion contribution. 

A. 4 Var {D2) for a uniform letter distribution 

For the case of a uniform letter distribution, = 1/d for all a G ^ where d = \ A\ is 
the alphabet size, we find that the crabgrass and off-diagonal part of the accordion 
contribution to Var (-D2) are zero, and that an analytic formula for the remaining, 
diagonal- accordion, contribution, can easily be found. 

A. 4.1 Crabgrass contribution, 77 = 

When 7? = 0, the perturbed binomial distribution reduces to the ordinary binomial 
distribution, independent of c [2T] : 



*(^-°->-G)(i)'(i)"*- 

Accordingly, the function /r(W) in Eq. [5n]is independent of the random word W, 
its variance is zero, and thus Yai {D2{nA-,nB-,k^t^d))\^^^Y)g!:ass ~ 

A. 4. 2 Diagonal-accordion contribution 

For arbitrary r/ we have (see Fig. E]) 

Var(I)2)ldiag.accordion = 2^ Cov (F^, F^) 

fc-1 

= riAUB ^ Gov (y„,y„+(r,r)) 

r=-fc+l 

fc-1 



UAUB 



Gov (y„, y„) + 2 ^ Gov {Yu, y„+(.,.)) 



r=l 



(28) 
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The covariance is 

l2 



Gov (y„, y„+(,,,)) = E[Yu, - e[YuY, (29) 



where 



-^[^j ^+{r,r)] — Pr — l)^+(r,r) — 1) 

min{/c-r,t) (-J 

^ Vi{/^{k-r) =l)^Vr{/^{r) = si)^Vr{/^{r) = 82) 

l=Q si=0 S2=0 



min(fc-r,t) /, \ f*"' / \ 

1=0 ^ ^ Ls=0 ^ ^ 



(30) 



and 



i?[yj = Pr (y„ = 1) = ^ Pr (A(fc) = = E (/ ) (1 - ^'2)^2'=-'. (31) 

z=o «=0 ^ ^ 

The l^^ term in Eq. (|3Up accounts for the event that there are up to t — Z mis- 
matches between (Aj, . . . , Aj+^-i) and (-Bj, . . . , Sj+^-i), exactly Z mismatches be- 
tween (^i+r, • • • , Aj+jfc_i) and {Bj^r, ■ ■ ■ , Bj^k-i) and up to i — / mismatches be- 
tween {Ai+k, ■■■ , A+k+r-i) and (Sj+fc, . . . , Bj+k+r-i)- 

For the case of a uniform letter distribution, one simply sets p2 = l/d in 
Eqs. ([30]) and ([SB . 



A. 4. 3 Off-diagonal-accordion contribution, rj = 

The proof that Var (£'2)loff-diag accordion = for a uniform letter distribution is 
non-trivial. First we establish some general results about the distance function 
A(Wi,W2), equal to the number of mismatches between two random /c-words 
Wi and W2. 

For a uniform letter distribution, and for two independent (i.e. non-overlapping) 
words Wi and W2, we have from Eq. ([7|) 

Pr(A(Wi,W2) = = Q(1-P2)V-' = 0^^^- (32) 

If one of the words is fixed to be w, one easily checks that the conditional proba- 
bility is also binomial: 

Pr (A(Wi, W2) = /|W2 = w) = (^^) = Pr (A(Wi, W2) = /) . (33) 

Thus A(Wi, W2) and W2 are independent random variables. 
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Now consider the case of three independent random words Wi, W2 and W3. 
Then 

Pr (A(Wi, W2) = /i, A(W2, W3) = /2) 
= ^ Pr(A(Wi,W2) = /i|W2 = w) 

xPr(A(W2, W3) = /2IW2 = w)Pr(W2 = w) 
= Yl Pr(A(Wi,W2) = /i)Pr(A(W2,W3) = /2)j^ 

= Pr(A(Wi,W2) = Zi)Pr(A(W2,W3) = /2) (34) 

where we have used the fact that, once W2 is fixed, A(Wi, W2) and A(W2, W3) 
depend only on Wi and W3 respectively, and so are effectively independent. 

We now generalise Eqs. (|33]) and (fM]l to the following proposition Pjv, which 
will be proved by induction: 

For given N > 2, let Wi, . . . , W^v+i be mutually independent /c- words, and define 

A, = A(Wi,W,), i = l,...,N. (35) 

Then for any w G A^, 

Pr (Ai = /i,...,Ajv-i = /7V-i|WAr = w) 
= Pr (Ai = /i,...,Ajv-i = /jv-i) (36) 

and 

Pr (Ai = /i,...,A^ = Z^) 

= Pr (Ai = Zi,...,AAr_i=/7v-i)Pr(A^ = /;v). (37) 

Note that Eq. ([36|) could equivalently be written as 

Pr (Ai = /i, . . . , Ajv-i = In^i\Wn G R) 

= Pr (Ai = /i,...,Ajv-i =/jv~i), (38) 

where R C is any restricted set of fc-words. Note also that combining Eq. (j37|) 
for the propositions P2 to P/v implies 

Pr (Ai = Zi, . . . , A;v = In) = Pr (Ai = /i) x . . . x Pr (A;v = In) (39) 

The proposition P2 is proved by Eqs. (j33p and (j34p . It remains to prove that 
Pn implies Pn+i- Define S(w, Z) = {x € ^'^|A(x, w) = /}. Starting with the left 
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hand side of Eq. (j36p with N replaced by + 1, we have 

Pr (Ai = Zi, . . . , Atv = ^7v|W7v+i = w) 
= Pr (Ai = /i, . . . , Atv = In\'Wn+i = w, g S{w, In)) 
xPr (W^ g5(w,/^)) 
+Pr (Ai = h,...,AN = In\Wn+i = w, Wjv ^ S{w, In)) 
xPr (Wn ^S{^v,In)) 
= Pr (Ai = /i, . . . , A^_i = In-1 {Wn+i = w, g 5(w, /jv)) 
xPr (Wn €S{^v,In)), 

where the second term is zero since "Aat = In" and "W^v ^ S{w,In)" are mu- 
tually exclusive events, and the requirement "A^v = In" has been dropped from 
the first term since it is automatically satisfied by the condition "W^v+i = w 
and Wjv G »S'(w, In)" ■ Then, since Ai, . . . , A^r-i are independent of W^v+i, and 
rewriting the second factor, we have 

Pr (Ai = /i, . . . , Aat = /Ar|W^+i = w) 

= Pr (Ai = h,..., An-1 = In-i\V^n G S(w, In)) 

xPr (Aat = ZatIWat+i = w) 
= Pr (Ai = Zi,... , Ajv_i = /Ar_i)Pr (Aat = /jv) by Eqs.(l33D and dM]) 
= Pr (Ai = /i,...,A;v = ^w) byEq. jSTj) (40) 

which establishes the first part of proposition Pn+i- Starting with the left hand 
side of Eq. ([37]) with N replaced + 1, 

Pr (Ai = /i, . . . , Aat+i = In+i) 
= ^ Pr (Ai = Zi, . . . , Atv = /tvIWtv+i = w) 

xPr (Aat+i = In+i\^n+i = w)Pr (Wat+i = w) 
= Pr (Ai = /i,...,Ajv = /jv)Pr (A^+i = /^+i)^ by Eq. (00]) 

= Pr (Ai = /i,...,A^ = /7v)Pr (Aat+i = ?jv+i), (41) 

which establishes the second half of proposition Pn+i- ^ 
We are now in a position to calculate 

Var(Z)2)| 

off —diag. accordion 

= E E Cov(y,,n). (42) 

Writing u = {i,j), v = (z + r, j + s) G J"°, the off-diagonal part can be 
subdivided into six parts illustrated in Fig. [3l namely 

Aside: For an alternate proof that Var {D2{nA, riB, k, t, 0))|j.j.jj,^g,.j^gg = one can apply 
the above proposition to the third line of Eq. (5) of [3]. 
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-k+ 1, 



-1, 0, 1, 



,k-\ 



-k+ 1 



k-\ 




Figure 3: The main diagonal and off-diagonal regions I to VI of J". 



I: 0<s<r<A;-l; 

II: -A; + 1 < s < r < 0; 

III: -A; + l<r<s<0; 

IV: < r < s < A; - 1; 

V: 1 < r < /c - 1, -A; + 1 < s < -1; 

VI: 1 < s < A; - 1, -/c + 1 < r < -1. 

We proceed to prove that Gov (Yu,Yy) vanishes for each of the six cases. 

Case I is ihustrated in Fig. IHa). The union of the overlapping words = 
i+k-l ) and W^;^ = {Ai+r^---,A 

j^fc+r-i) is subdivided into the shaded 
pieces Wg ' = {Ai, ... , Aj+^-i) and Wq ' = {Ai+k+r-s, ■■■ , Ai+k+r-i) each of 
length s, and a set of and a set of single-letter words = {Ai^s+a~i), ol = 
l,...,k + r -2s. 

Similarly, the union of the overlapping words = {Bj, . . . , Bj^k-i) and 
= {Bj^s, ■ ■ ■ , Bj+k+s-i) is subdivided into the shaded pieces W^'"^ = iBi,...,Bj+s-i) 
and Wq ' = {Bj^k, ■ ■ ■ , Bj^k+s-i) each of length s, and a set of and a set of 
single-letter words = (Bj^s+a-i), a = I, . . . ,k — s. 

Define 



A« = A(Wo"'«,Wo^'^) 



= AfW^ W^) A"- = AfW^ W 

'-^a '-'V**a) '-^a '-'V a ) *' 



R 



a+r—sji 



a 



,k 
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Figure 4: (a) Case I of the off-diagonal accordion contribution to Var(D2)- 
Cases II, III and IV are obtained by reflection or by interchanging the roles 
of A and B. (b) Case V of the off-diagonal contribution. Case VI is obtained 
by interchanging the roles of A and B. 
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Then 

k~s k—s 

A(Wi^,Wf) = J]A^, A(Wi^,Wf) = ^Af. (43) 

With the indicator variables Yu and defined as above, we have 

E{Yu,Y,)=Fi{Yu = l,Y, = l) 

= Pr (A(W^, ) < t,A(WtW^) < t) 

= Yl E Pr(A^ = mo,... 

. . . , A^_, = mk-s, A^ = /o, . . . , Af_, = Ik-s) , (44) 
where the index set summed over is 

It = llo,..., h-s 0<lo<s,0<h,..., k-s < 1, ^ 4 ■ ^^^^ 

The set {Af, . . . , A^_g, Af , . . . , partitions into a collection of disjoint sub- 

sets of the form {A^, A^, A^+.^_^, A^+^_^, A^_^2(r-s)' • ■ ■}, a = 1, . . . ,r - s (indi- 
cated by the zig-zag line in Fig. [Ula)), each of which satisfies the conditions of the 
proposition P/v for some A^. Also, these subsets are independent of one another 
and of Ag and Ag^, since they contain random variables which are functions of 
corresponding disjoint subsets of letters. 

Thus we can factor the probability in Eq. (|44p and rearrange the sum to obtain 

E{Yu, Y,) = ^ Pr (A^ = mo) . . . Pr (At. = m^.,) 

{mo,.--,mfc-s}G/t 

X Pr (Ao^ = /o)...Pr (Af_, = 

{mo,...,m^:_s}€lt 

X Y Pr (Ao^ = Zo,...,At. = M 

= Pr (A(W;^, W,f ) < t) Pr (A(W,^, W,f ) < t) 

= E{Yu)E{Y,). (46) 

Thus Gov (y^j, y^,) = for V in the Case I part of J^". Cases II, III and IV can be 
similarly dealt with by reversing the order of both sequences, interchanging the 
roles of sequences A and or both. 

Case V is illustrated in Fig. [D^b). This time the union of the overlapping words 
and is subdivided into the set of single-letter words = (Aj+Q.i), 
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a = 1, . . . ,k+r, and the union of the overlapping words and is subdivided 
into the set of single-letter words = (i?j_|s|+a-i), a = 1, . . . , fc+ We define 

= A(W^, Wf_,|,|), A« = A(Wf , W^+,), a = l,...,k. 

Then 

k k 

A(Wi^,Wf) = ^A^, A{WtW^) = Y.^^^ (47) 

0=1 a=l 

and 

E{Yu,Y,) = Pr(y„ = i,y, = 1) 

= Pr (A(Wi^,Wf)<t,A(Wi^,Wf)<t) 

= E E Pr(Af = mi,... 

{mi,...,mfc}e/t {h,...,lk}&It 

...,A^ = mfc,Af = Zi,...,Af = ^fc), (48) 

where the index set is now 



k 

0</i,...,/fc<l,^/„<t I 

a=l ) 



(49) 



The set {A^^, . . . , A^, Af , . . . , A^} partitions into a collection of disjoint sub- 
sets of the form {A^, Af_^|^|, A^_^^_^|^|, . . .}, a = 1,... ,r, or 

{^2) ^a+l^l' Af+r+|s|' • • •}, a = 1, • • • , |s| (indicated by the zig-zag line in Fig. Stb)), 
each of which satisfies the conditions of the proposition P/v for some N ^ and which 
are mutually independent. Thus we can factor the probability in Eq. (j48p . rearrange 
the sum and recombine the probabilities to obtain 

E{Yu,Y,) = Yl Pr (Af = mi,...,Af =mfc) 

{mi,...,mh}elt 

X Yl Pr (Af = /i,...,A« = /fc) 

{h,-,ik}eit 

= Pr {A{WtW^) < t) Pr {A{W^,W^) < t) 

= E{Yu)E{Y,), (50) 

giving Gov (1^, Y^,) = for v in the Case V part of J°°. Case VI can be similarly 
dealt with by interchanging the roles of sequences A and B. 
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